Sensors

2024-07-16 00:42:28| 来源: 网络整理| 查看: 265

Next Article in Journal Crime Light Imaging (CLI): A Novel Sensor for Stand-Off Detection and Localization of Forensic Traces Next Article in Special Issue On the Importance of Attention and Augmentations for Hypothesis Transfer in Domain Adaptation and Generalization Previous Article in Journal A Unified Multimodal Interface for the RELAX High-Payload Collaborative Robot Previous Article in Special Issue POSEIDON: A Data Augmentation Tool for Small Object Detection Datasets in Maritime Environments Journals Active Journals Find a Journal Proceedings Series Topics Information For Authors For Reviewers For Editors For Librarians For Publishers For Societies For Conference Organizers Open Access Policy Institutional Open Access Program Special Issues Guidelines Editorial Process Research and Publication Ethics Article Processing Charges Awards Testimonials Author Services Initiatives Sciforum MDPI Books Preprints.org Scilit SciProfiles Encyclopedia JAMS Proceedings Series About Overview Contact Careers News Press Blog Sign In / Sign Up Notice clear Notice

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

Continue Cancel clear

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess.

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

Journals Active Journals Find a Journal Proceedings Series Topics Information For Authors For Reviewers For Editors For Librarians For Publishers For Societies For Conference Organizers Open Access Policy Institutional Open Access Program Special Issues Guidelines Editorial Process Research and Publication Ethics Article Processing Charges Awards Testimonials Author Services Initiatives Sciforum MDPI Books Preprints.org Scilit SciProfiles Encyclopedia JAMS Proceedings Series About Overview Contact Careers News Press Blog Sign In / Sign Up Submit Journals Sensors Volume 23 Issue 18 10.3390/s23187734

Submit to this Journal Review for this Journal Propose a Special Issue ► ▼ Article Menu Article Menu Academic Editor

Amir Atapour-Abarghouei Subscribe SciFeed Recommended Articles Related Info Links PubMed/Medline Google Scholar More by Authors Links on DOAJ Sharif, M. Haidar Jiao, L. Omlin, C. W. on Google Scholar Sharif, M. Haidar Jiao, L. Omlin, C. W. on PubMed Sharif, M. Haidar Jiao, L. Omlin, C. W. /ajax/scifeed/subscribe Article Views Citations - Table of Contents Altmetric share Share announcement Help format_quote Cite question_answer Discuss in SciProfiles thumb_up ... Endorse textsms ... Comment Need Help? Support

Find support for a specific problem in the support section of our website.

Get Support Feedback

Please let us know what you think of our products and services.

Give Feedback Information

Visit our dedicated information section to learn more about MDPI.

Get Information clear JSmol Viewer clear first_page Download PDF settings Order Article Reprints Font Type: Arial Georgia Verdana Font Size: Aa Aa Aa Line Spacing:    Column Width:    Background: Open AccessArticle CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection by Md. Haidar Sharif

Md. Haidar Sharif SciProfiles Scilit Preprints.org Google Scholar * ORCID

, Lei Jiao

Lei Jiao SciProfiles Scilit Preprints.org Google Scholar ORCID

and Christian W. Omlin

Christian W. Omlin SciProfiles Scilit Preprints.org Google Scholar Department of ICT, University of Agder, 4630 Kristiansand, Norway * Author to whom correspondence should be addressed. Sensors 2023, 23(18), 7734; https://doi.org/10.3390/s23187734 Submission received: 1 August 2023 / Revised: 1 September 2023 / Accepted: 4 September 2023 / Published: 7 September 2023 (This article belongs to the Special Issue Sensor-Based Object Detection and Recognition in Intelligent Surveillance Systems) Download keyboard_arrow_down Download PDF Download PDF with Cover Download XML Download Epub Browse Figures Review Reports Versions Notes

Abstract: Video anomaly event detection (VAED) is one of the key technologies in computer vision for smart surveillance systems. With the advent of deep learning, contemporary advances in VAED have achieved substantial success. Recently, weakly supervised VAED (WVAED) has become a popular VAED technical route of research. WVAED methods do not depend on a supplementary self-supervised substitute task, yet they can assess anomaly scores straightway. However, the performance of WVAED methods depends on pretrained feature extractors. In this paper, we first address taking advantage of two pretrained feature extractors for CNN (e.g., C3D and I3D) and ViT (e.g., CLIP), for effectively extracting discerning representations. We then consider long-range and short-range temporal dependencies and put forward video snippets of interest by leveraging our proposed temporal self-attention network (TSAN). We design a multiple instance learning (MIL)-based generalized architecture named CNN-ViT-TSAN, by using CNN- and/or ViT-extracted features and TSAN to specify a series of models for the WVAED problem. Experimental results on publicly available popular crowd datasets demonstrated the effectiveness of our CNN-ViT-TSAN. Keywords: attention; convolutional neural network (CNN); Mahalanobis distance; multiple instance learning (MIL); vision transformer (ViT); weakly supervised video anomaly event detection 1. IntroductionFully supervised, unsupervised, and weakly supervised are the three dominant paradigms in video anomaly event detection (VAED). The fully supervised paradigm mostly gives a high performance [1]. Nevertheless, frame-level normal or abnormal annotations in the training data are essential, which requires the video annotators to localize and label abnormalities in videos. As abnormalities can take place at any time, nearly all frames need to be spotted by the annotators. Unfortunately, it can be a non-automated and time-consuming process to accumulate a fully annotated large-scale dataset for supervised VAED.In the unsupervised paradigm, the models are trained on samples of normal events solely, along with a common assumption that the unseen anomaly videos will have high reconstruction errors [2,3,4]. Unluckily, the performance of unsupervised VAED is commonly inferior, due to its lack of advance understanding of anomalies, as well as its inability to capture all kinds of normality variants [5]. The weakly supervised approaches are thus considered to be the most practical paradigm, over both unsupervised and supervised paradigms, due to their competitive performance as well as annotation cost-effectiveness, by applying video-level labels to lower the cost of laborious fine-grained annotations [6,7].Nowadays, WVAED has become an established VAED technical route of research [6,7,8,9,10,11,12,13,14,15,16]. The WVAED problem is mainly regarded as an MIL (multiple instance learning) problem [8]. In general, WVAED models directly output anomaly scores by comparing the spatiotemporal features of normal and abnormal events through the MIL. The MIL pertains to training data organized in sets, called positive and negative bags. A video in MIL is regarded as a bag holding multiple instances, where each instance belongs to a video snippet. A negative bag contains all normal snippets, whereas a positive one contains both normal and abnormal snippets, without any temporal information about the beginning and end of abnormal events. The standard MIL assumes that all negative bags accommodate only negative snippets, and that positive bags carry no less than one positive snippet. Supervision is provided solely for complete sets, and the isolated label of the snippets contained in the bags is not provided [17]. As WVAED can understand the essential variability between normal and abnormal, its outputs are fundamentally more reliable than those of unsupervised VAED [18]. However, in WVAED, abnormal-labeled frames of the positive bag tend to be influenced by normal-labeled frames in the negative bag, while the abnormality will not certainly stand out in opposition to the normality. Subsequently, sometimes it becomes difficult to detect anomalous snippets. Many researchers (e.g., [8,9,10,19,20]) have made efforts to take this problem forward using MIL frameworks. Many of the existing approaches encode the extracted visual content by applying a backbone (e.g., C3D [21], I3D [22]), which are pretrained on action recognition tasks. However, VAD depends on discriminative representations that clearly represent the events in a scene. Thus, these existing backbones are not suitable for VAD, due to the domain gap [1]. To address this limitation, and inspired by the success of the recent vision-language works of [23,24,25], which proved the potency of feature representation learned via contrastive language-image pretraining (CLIP) [26], Joo et al. [20] employed the vision transformer (ViT) encoded visual features from CLIP [26]. However, the performance of MIL-based WVAED methods heavily depends on the pretrained feature extractors.In this paper, we first propose utilizing pretrained feature extractors using backbones of both CNN (e.g., C3D [21], I3D [22]) and ViT (e.g., CLIP [26]) for extracting discerning representations effectively. We propose a temporal self-attention network (TSAN) to generate the reweighed attention feature by modeling the continuity between snippets of a video and selecting the top-k most relevant snippets. Later, the reweighed attention features are used to produce anomaly scores using a multi-layer perceptron (MLP) based score allocator. In the TSAN pipeline, we utilize the statistically most significant features as probabilities by employing a temporal scoring technique considering Mahalanobis distances instead of the mean feature magnitudes of snippets. The motivations behind the usage of the Mahalanobis metric over the mean are as follows: (i) It can correct the correlations between the different features; (ii) It automatically accounts for the scaling of the coordinate axes; (iii) It can provide curved as well as linear decision boundaries. Our ablation study showed that maximum mean of 5.34% better performance can be achieved empirically by employing the Mahalanobis metric. In addition, the TSAN also deals with an arbitrary number of abnormal snippets in an abnormal video. The top-k selector in the TSAN addresses k-snippets of interest in the video. We model long-range and short-range temporal dependencies and put forward the snippets of interest by supporting TSAN. In brief, we design a MIL-based generalized architecture of CNN-ViT-TSAN, as portrayed in Figure 1, to specialize five different models, namely C3D-TSAN, I3D-TSAN, CLIP-TSAN, C3D-CLIP-TSAN, and I3D-CLIP-TSAN, for WVAED problems. Each model consists of three main modules responsible for (i) Feature encoding by the CNN and/or ViT; (ii) Patterning snippet consistency in the temporal dimension using TSAN; and (iii) Identifying abnormal snippets in connection with the separation maximization supervisor (SMS), where the SMS trains the abnormal snippets to have a high value and the normal snippets to have a low value. The C3D-TSAN and I3D-TSAN models do not require ViT-based feature extraction, while the CLIP-TSAN model does not need CNN-based feature extraction. Information fusion takes place in the TSAN for C3D-CLIP-TSAN and I3D-CLIP-TSAN models only, whereas the models for C3D-TSAN, I3D-TSAN, and CLIP-TSAN skip it. Each of our proposed models is based on a distinct degree of feature extraction and usability capabilities required for crowd video anomaly detection. Consequently, in experimental setups considering UMN, UCSD-Ped1, UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets, some of these models demonstrated inferior results, while others showed superior results. For example, the model I3D-CLIP-TSAN demonstrated the best results and outperformed its alternatives by extracting and using high-quality features from the available videos, as well as confirming a better normal—abnormal snippet separability.The unique contributions and advancements that our proposed CNN-ViT-TSAN framework brings to the field of WVAED problems are recapitulated as follows:We propose five deep models for WVAED problems by designing a MIL-based generalized framework CNN-ViT-TSAN. The information fusion between CNN and ViT is a unique contribution;We propose a TSAN that helps to provide anomaly scores for video snippets in WVAED problems;We uniquely introduce the usage of the Mahalanobis metric for calculating probability scores in the TSAN;Experiments on several benchmark datasets demonstrated the superiority of our models compared with the state-of-the-art approaches.The rest of this paper is organized as follows; Section 2 addresses the most relevant previous studies. Section 3 discusses our proposed generalized framework. Section 4 explains the experimental setup; the results obtained on public datasets; as well as a comparison, reasons for superiority, best network analysis, ablation study, and limitations of our models. Section 5 concludes the paper with a few clues for further study. 2. Related WorkMethods of WVAED are based on video-level labels, which always follow the MIL ranking framework [8]. Based on MIL, a method of WVAED trains a regression model to assign scores for video snippets, assuming that the maximum score of the positive bag is higher than that of the negative bag. The existing methods of WVAED can be roughly categorized into two broad kinds on the basis of the pretrained models used, namely: CNN-based and ViT-based WVAED methods, as summarized below. 2.1. CNN-Based WVAED MethodsSultani et al. [8], Tian et al. [19], Zhang et al. [9], Zhong et al. [6], and Zhu et al. [11] employed CNN-based pretrained models in their experimental setups. Sultani et al. [8] also pre-collected annotated normal and abnormal video events at video-level to build their popular UCF-Crime dataset and applied it with their weakly supervised framework for detecting anomalies. In their framework, after extracting C3D features [27] for video segments, they trained a fully connected neural network by applying a ranking loss function, which computed the ranking loss between the highest scored instances in the positive bag and the negative bag. Tian et al. [19] treated C3D [27] and I3D [22] as feature extractors for their WVAED model. They claimed that the selection of the top-3 features based on their magnitude can introduce a greater partition between normal and anomalous videos, where if more than one abnormal snippet exists per anomalous video, the mean snippet feature magnitude of the anomalous videos is larger than that of normal videos.Zhang et al. [9] trained a temporal convolution network between the preceding adjacent segment and current segment for extracting positive and negative video segment C3D features [27]. Afterwards, they trained two branches of a fully connected neural network using an inner and outer bag ranking loss, considering the highest and lowest scored segments in the positive and the negative bags. Zhong et al. [6] and Zhu et al. [11] trained both a feature encoder and classifier together. Zhong et al. [6] addressed WVAED as a supervised learning task under noise labels. However, to verify the widespread applicability of their model, they carried out extensive experiments considering a C3D [27] and a temporal segment network [28]. Zhu et al. [11] included the temporal context into their MIL ranking model by applying an attention block. They claimed that features containing motion information extracted by C3D [27] and I3D [22] performed better than features extracted from separate images using VGG16 [29] and Inception [30], regardless of the network depth and feature dimension. 2.2. ViT-Based WVAED MethodsViT-based pretrained models can be categorized into single-stream or dual-stream types. The single-stream model applies a single transformer to model both image (or video) and text representations in a combined framework, whereas the dual-stream model independently encodes image (or video) and text with a decoupled encoder. Examples of ViT feature extractors include VisualBERT [31], ViLBERT [32], CLIP [26], and data efficient CLIP [33]. Recently, Joo et al. [20] proposed a CLIP-assisted [26] temporal self-attention framework for the WVAED problem. They conducted experiments on publicly available datasets to verify their end-to-end WVAED framework. Li et al. [34] suggested a transformer-based multi-instance learning network to learn video-level anomaly probability and snippet-level anomaly scores. In the inference stage, they employed the video-level anomaly probability to suppress the fluctuation of snippet-level anomaly scores. Lv et al. [35] presented an unbiased MIL scheme that learned an unbiased anomaly classifier and a tailored representation for WVAED.In view of the existing solutions, we found that, generally, a CNN and ViT are employed separately. To take advantage of both CNN- and ViT-based pretrained models, we designed an MIL-supported generalized architecture named CNN-ViT-TSAN to specify a series of models for the WVAED problem. 3. Proposed Generalized FrameworkOur generalized framework follows the MIL model, in which the positive bag represents an anomaly and the negative bag denotes normality. Its constituent components are discussed in the following subsections. 3.1. Feature ExtractionVideos in the training set are only labeled at video-level in WVAED. Assume that a set of weakly labeled training videos W = { V v , y v } v = 1 | W | are available, where each video V v = { F r a m e i } i = 1 N v ∈ R N v × W × H hints at a sequence of N v frames with W pixels for width and H pixels for height. Here, y v = { 0 , 1 } indicates the video-level label of video V v with respect to anomaly, i.e., it is 1 for an anomaly video that holds at least one abnormal event, otherwise 0. For a video V v = { F r a m e i } i = 1 N v , we divide it into a set of { γ i } i = 1 ⌊ N v Δ ⌋ equal number of non-overlapping temporal snippets each with a length of Δ -frame. 3.1.1. Feature Extraction Using a Pretrained CNNConvolutional neural networks (CNN), as one of the most representative deep learning models, exhibit great potential in the field of image classification. CNN-based C3D (Convolutional 3D) [21] and I3D [22] are two common feature extractors. As a feature extractor, the C3D is generic, compact, simple, and efficient. Tran et al. [27] showed that C3D can model appearance and motion information simultaneously and outperformed the 2D CNN features in various video-analysis tasks. Carreira et al. [22] introduced a two-stream (i.e., RGB and Flow) Inflated 3D CNN (I3D). Ideally, feature extraction can be efficiently performed by either C3D or I3D. We considered the C3D feature of Ji et al. [21] and I3D feature of Carreira-Zisserman [22]. We computed features of T snippets with feature dimension ℵ ′ using both C3D and I3D separately. Let Φ v c n n ′ = { ϕ i } i = 1 T v ∈ R T v × ℵ ′ be the extracted features of V v , where T v belongs to the number of snippets for V v .For the dimensionality reduction technique, the principal component analysis (PCA) works under the assumption that the data follow a normal distribution. For this reason, they may be very sensitive to the variance of the variables. In addition, as the extracted data are not normalized, the reduced dimensions using PCA or other similar techniques would give erroneous results. However, the low-variance-filter is an advantageous dimensionality reduction algorithm often used in machine learning on numerical data. Instead of using PCA, we apply the low-variance-filter algorithm to reduce the dimensionality of the extracted data. Upon dimensionality reduction, Φ v c n n ′ ∈ R T v × ℵ ′ can obtain the shape of Φ v c n n ∈ R T × ℵ , i.e., ℵ-dimensional feature of the T snippets. 3.1.2. Feature Extraction Using a Pretrained ViTVision-language pretrained models extract the relationships between objects/actions in a video and objects/actions in text using vision transformers (ViTs). Based on the suitability of applications, various kinds of ViTs exist, e.g., VisualBERT [31], ViLBERT [32], CLIP [26], and data efficient CLIP [33]. In general, CLIP [26] is a multi-modal vision and language model, which utilizes a ViT as a backbone for visual features. We assume that the middle frame d j = ⌈ Δ 2 ⌉ represents the snippet γ j , instead of considering all frames in a snippet γ j . Following Joo et al. [20], we apply CLIP [26] to the d j of the snippet γ j to represent its feature as ϕ j ∈ R ℵ with feature dimensions ℵ, and then V v can be constituted as a set of video feature vectors Φ v v i t = { ϕ j } j = 1 T v ∈ R T × ℵ . 3.2. Temporal Self-Attention Network (TSAN)Figure 1 visualizes our proposed TSAN mechanism, which models the snippet coherency and selects the top-k most significant snippets. It maximizes the attention on a subset of features, while it minimizes attention on noise. The pipeline of TSAN consists of four components namely: (i) a temporal scoring module, (ii) top-k selecting module, (iii) multiplying-averaging module, and (iv) information fusion module. 3.2.1. Temporal Scoring ModuleThe temporal scoring technique utilizes the statistically most significant features as probabilities, considering Mahalanobis distances instead of the mean feature magnitudes of the snippets. The mathematical exposition is given in Algorithm 1. The scores of P score ∈ R T × 1 are employed to estimate anomaly attention features, upon extracting the k most significant snippets from the video using Algorithm 2. Concisely, each of Φ v c n n ∈ R T × ℵ and Φ v v i t ∈ R T × ℵ can be converted into a probability score vector P score ∈ R T × 1 using Algorithm 1, where each score represents a snippet. The scores of P score ∈ R T × 1 are fed to the top-k selector module for further processing. The model CLIP-TSAN does not expect the Φ v c n n ∈ R T × ℵ to be processed using Algorithm 1 to obtain P score ∈ R T × 1 . In this case, the final output Φ v c n n ¯ ∈ R T × ℵ of the multiplying-averaging module has no active function in the information fusion module. Thus, solely Φ v v i t ∈ R T × ℵ is processed using Algorithm 1 to obtain P score ∈ R T × 1 for feeding to the top-k selecting module. Conversely, the model of C3D-TSAN does not look for the Φ v v i t ∈ R T × ℵ to be processed using Algorithm 1 to obtain P score ∈ R T × 1 . In this instance, the final output Φ v v i t ¯ ∈ R T × ℵ of the multiplying-averaging module has no operational function in the information fusion module. Consequently, only Φ v c n n ∈ R T × ℵ is processed considering Algorithm 1 to obtain P score ∈ R T × 1 . Likewise, the model I3D-TSAN does not expect the scores of P score ∈ R T × 1 obtained from Φ v v i t ∈ R T × ℵ . However, the models of C3D-CLIP-TSAN and I3D-CLIP-TSAN need the scores of P score ∈ R T × 1 obtained from both Φ v c n n ∈ R T × ℵ and Φ v v i t ∈ R T × ℵ . They use Algorithm 1 to obtain P score ∈ R T × 1 in a sequential manner, such as in CLIP-TSAN, C3D-TSAN, and/or I3D-TSAN. In the case of either C3D-CLIP-TSAN or I3D-CLIP-TSAN, the final outputs of Φ v c n n ¯ ∈ R T × ℵ and Φ v v i t ¯ ∈ R T × ℵ from the multiplying-averaging module are stored in the information fusion module for element-wise addition. Algorithm 1: Calculation of the probability scores P score considering Mahalanobis distances Sensors 23 07734 i001

Algorithm 2: Processing of the probability scores P score in the TSAN Sensors 23 07734 i002

3.2.2. Top-k Selecting ModuleThis module extracts the k ℵ . The thresholds of the low-variance-filter were 0.00723, 0.00835, 0.00875, and 0.00911 for the datasets of UMN, Peds, ShanghaiTech, and UCF-Crime, respectively. The three-layered MLP of 512, 256, and 1 units with its hidden layer was followed by a ReLU activation function, and its final layer was followed by a sigmoid function, to produce a value between 0 and 1. Our model was trained in an end-to-end manner and implemented using PyTorch [41]. We used the Adam optimizer [42] with a weight decay of 0.0005 and a batch size of 32 for 50 epochs. The learning rate was set to 0.001 for all datasets. We employed an Intel ® CoreTM i7-7800X CPU @3.50 GHz, along with an NVIDIA graphics card GeForce GTX 1080 for both training and evaluation of the model. We also adopted OpenAI, Google Colab, and Google Drive for feature extraction. We used the area under the receiver operating characteristic (ROC) curve ( A U C ) to evaluate the overall model performance. The 0 ≤ A U C ≤ 1 is one of the most frequently used metrics for evaluating various flows and events in crowd videos [8,43,44]. The predictions of a model were 100% wrong or correct if A U C = 0 or A U C = 1 , respectively. Intuitively, a larger A U C implies a larger margin between the normal and abnormal snippet predictions, thus resulting in a better anomaly classifier. The sensitivity, recall, hit rate, and true positive rate ( T P R ) can be formulated using Equation (6) as T P R = t p t p + f n , where t p and f n specify the number of true positive frames and the number of false negative frames, respectively. The fall-out or false positive rate ( F P R ) can be formulated using Equation (7) as F P R = 1 − t n t n + f p = f p f p + t n , where f p and t n indicate the number of false positive frames and the number of true negative frames, respectively. The ROC curve is a two-dimensional graphical visualization, in which the F P R is plotted on the X-axis and the T P R is plotted on the Y-axis (e.g., right side subgraphs of Figure 2). The values of A U C are calculated as the areas below the ROC curves (e.g., the yellow colored regions of Figure 2). Mathematically, the value of A U C can be calculated using the trapezoidal numerical integration method [45]. 4.3. Results on Various DatasetsAs real-world abnormal events are miscellaneous and hard to predict, to demonstrate the applicability of our generalized framework to multiple environments, we ran experiments on frequently used VAED evaluation datasets, e.g., UMN, UCSD-Ped1, UCSD-Ped2, ShanghaiTech, and UCF-Crime. Figure 2 visualizes the sample testing results of I3D-CLIP-TSAN (Ours) with videos from the UMN, UCSD-Ped1, UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets, including abnormal events with sudden running of people, vehicles passing between bidirectional flows of people, bicycle riders in a pedestrian zone, bicycles crossing, and the action of taking something from a person forcefully as well as unlawfully, respectively. The obtained frame-level AUC scores of the sample testing videos in Figure 2 were 0.991, 0.943, 0.986, 0.989, and 0.912, consecutively. Although UMN, UCSD-Ped1, and UCSD-Ped2 are popular benchmarks for video anomaly detection, they are small in terms of number of videos and the duration of the video. Alterations in the anomalies are also very narrow. Furthermore, some abnormalities are not practical or sometimes the spatial annotation is not very clear. For these reasons, few authors have conducted experiments with these datasets explicitly. However, we considered all these datasets, to show the generalizability of our models. From Figure 2, it is noticeable that I3D-CLIP-TSAN (ours) was suitable for detecting various anomaly events, ranging from simple datasets (e.g., UMN, UCSD-Ped1, and UCSD-Ped2) to large-scale datasets (e.g., ShanghaiTech and UCF-Crime). 4.4. Performance ComparisonAssume that A U C o denotes the A U C computed on the overall testing videos in a dataset. Table 1 compares the frame-level A U C o performance scores of our models for the UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets, along with state-of-the-art methods. It seems that our proposed models could be generalized for detecting various abnormal events from those datasets. In general, both ShanghaiTech and UCF-Crime would be called wide-scale anomaly detection datasets. All authors in Table 1 considered the ShanghaiTech and UCF-Crime datasets for conducting their experiments.The reported results in Table 1 indicate that the improvements in performance by our proposed methods on the ShanghaiTech and UCF-Crime datasets were more remarkable than those for the UCSD-Ped2 dataset. However, for a coherent and intelligible comparaison of the performance of the various methods, we performed a non-parametric statistical investigation based on the results presented in Table 1, considering two categories: the first category consisted of ShanghaiTech and UCF-Crime datasets only, while the second category consideredthe UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets.Figure 3 depicts the Nemenyi [64] post hoc critical distance diagram at a level of significance of α = 0.05 , considering 1 − A U C o scores in Table 1 for the first category with the existing models of Sultani et al. (2018) [8], Zhong et al. (2019) [6], Zhang et al. (2019) [9], Zaheer et al. (2020) [46], Zaheer et al. (2020) [7], Wan et al. (2020) [47], Purwanto et al. (2021) [13], Tian et al. (2021) [19], Majhi et al. (2021) [48], Wu et al. (2021) [49], Yu et al. (2021) [50], Lv et al. (2021) [12], Feng et al. (2021) [51], Zaheer et al. (2022) [3], Zaheer et al. (2022) [52], Joo et al. (2022) [20], Cao et al. (2022) [53], Li et al. (2022) [34], Cao et al. (2022) [54], Tan et al. (2022) [55], Li et al. (2022) [34], Yi et al. (2022) [56], Yu et al. (2022) [57], Gong et al. (2022) [58], Majhi et al. (2023) [59], Park et al. (2023) [60], Pu et al. (2023) [61], Lv et al. (2023) [35], Sun et al. (2023) [62], and Wang et al. (2023) [63]. If the distance between the two models is less than the Nemenyi [64] post hoc critical distance at a certain p-value (e.g., 0.05), there is no statistically significant difference between them. Explicitly, two models are considered significantly different if their performance variation is greater than the Nemenyi [64] post hoc critical distance. To this end, from Figure 3, it is noticeable that at α = 0.05 , none of the model pairs are statistically significant, as the heavy red line of length 51.7871 (which is called the Nemenyi [64] post hoc critical distance) is greater than the heavy pink line. For example, the distance between the hypothesis of I3D-CLIP-TSAN (ours) vs. Sultani et al. 2018 (C3D) [8] is | 44 − 1 | = 43 (heavy pink line), which is less than 51.7871 at α = 0.05 (i.e., 95% confidence limit). In other words, their distance difference was lacking by a numerical value of | 51.7871 − 43 | = 8.7871 . Consequently, I3D-CLIP-TSAN (ours) and Sultani et al. 2018 (C3D) [8] were not statistically significant. Similarly, the hypothesis on the difference by Joo et al. 2022 (CLIP) [20] vs. Sultani et al. 2018 (C3D) [8] was not statistically significant, as their distance difference was lacking by a numerical value of | 51.7871 + 3 − 44 | = 10.7871 . However, the model I3D-CLIP-TSAN (ours) was 1 − 8.7871 / 10.7871 = 18.54 % , more statistically significant than that of Joo et al. 2022 (CLIP) [20].This implies that I3D-CLIP-TSAN (ours) was slightly better generalized for divergent anomaly event detection from videos from the ShanghaiTech and UCF-Crime datasets than any other model in Table 1.Figure 4 shows a Nemenyi [64] post hoc critical distance diagram at the level of significance α = 0.10 considering the 1 − A U C o scores in Table 1 for the second category with the existing models of Zhong et al. (2019) [6], Zaheer et al. (2020) [46], Tian et al. (2021) [19], and Zaheer et al. (2022) [3]. Few models fell into this category, due to the avoidance of the UCSD-Ped2 dataset by many authors. However, from Figure 4, it is noticeable that the result of the difference of I3D-CLIP-TSAN (Ours) vs. Zaheer et al. 2022 (C3D) is statistically significant, as their distance difference (i.e., | 9.6667 − 1.3333 | = 8.3334 ) was greater than 7.2184 at a 90% confidence limit. Similarly, the results for the differences of I3D-CLIP-TSAN (Ours) vs. Zhong et al. 2019 (TSN) and C3D-CLIP-TSAN (Ours) vs. Zaheer et al. 2022 (C3D) were statistically significant. However, other results for the differences of this category were not statistically significant, as their distance differences were less than 7.2184.In summery, some of our proposed methods demonstrated their superiority among many existing state-of-the-art methods, as indicated in Table 1. Notably, the aforementioned statistical analysis shows that the method I3D-CLIP-TSAN (ours) took the top place in the rankings of each category. This implies that I3D-CLIP-TSAN (ours) has the ability to utilize good features from the pretrained CNN-ViT feature extractors considering the available videos and confirmed the high disconnectedness between the standard and abnormal snippets for VAED. 4.5. Reasons for Superiority 4.5.1. Advantage of Information FusionIn TSAN, both CNN- and ViT-related processing can produce their own reweighed attention features (e.g., Φ v c n n ∈ R T × ℵ and Φ v v i t ∈ R T × ℵ ), which can be directly used by C3D-TSAN, I3D-TSAN, and CLIP-TSAN models, as the features can individually provide necessary (but possibly not sufficient) information for producing the anomaly scores used for anomaly detection. However, the information fusion (e.g., Φ v f u s i o n ∈ R T × ℵ = Φ v c n n ∈ R T × ℵ + Φ v v i t ∈ R T × ℵ ) of these two atypical backbones can augment the quality of feature representation. Both the C3D-CLIP-TSAN and I3D-CLIP-TSAN models applied Φ v f u s i o n ∈ R T × ℵ and achieved superior performance, as compared to the other models. For example, from Table 1, using the UCF-Crime dataset, the model I3D-CLIP-TSAN (ours) achieved a 1 − 0.8897 / 0.8650 ≈ 3 % and 1 − 0.8897 / 0.8763 ≈ 2 % better performance with respect to I3D-TSAN (ours) and CLIP-TSAN (ours), respectively. Clearly, the performance gains of 3 % and 2 % for I3D-CLIP-TSAN (ours) were the contribution of the information fusion in the TSAN. 4.5.2. Better Information Gains with the Mahalanobis MetricTian et al. [19] assumed that the mean feature magnitude of abnormal snippets is larger than that of the normal snippets. However, we applied the measure of Mahalanobis distances, which is much larger and more accurate than that of the mean feature magnitudes. We provide a simple example using the UMN dataset [38].Usually, any video from the UMN dataset [38] starts with a normal event and ends with an abnormal event. Assume that we obtained the spatiotemporal information of each frame f (where f ∈ { 1 , 2 , … , 900 } ) in a video (e.g., third video) from the UMN dataset [38] using an existing optical-flow method. For any f, irrespective of normal or abnormal events, we consider the spatiotemporal information of five features that are observed in time and put in the form of a matrix M ∈ R n × 5 , as follows: M ( u ) ( v ) = x ( 1 ) ( 1 ) x ( 1 ) ( 2 ) x ( 1 ) ( 3 ) x ( 1 ) ( 4 ) x ( 1 ) ( 5 ) . . . . . x ( i ) ( 1 ) x ( i ) ( 2 ) x ( i ) ( 3 ) x ( i ) ( 4 ) x ( i ) ( 5 ) . . . . . x ( n ) ( 1 ) x ( n ) ( 2 ) x ( n ) ( 3 ) x ( n ) ( 4 ) x ( n ) ( 5 ) , where u ∈ { 1 , 2 , ⋯ , n } ; i ∈ u ; v ∈ { 1 , 2 , 3 , 4 , 5 } ; x ( i ) ( 1 ) ↦ x-coordinate of i; x ( i ) ( 2 ) ↦ y-coordinate of i; x ( i ) ( 3 ) ↦ x-velocity of i; x ( i ) ( 4 ) ↦ y-velocity of i; and x ( i ) ( 5 ) ↦ resulting motion direction of i.We calculate the sum of the mean feature magnitudes of f denoted as S m e a n ( f ) and the sum of Mahalanobis distances (considering Algorithm 1) denoted as S M a h a l ( f ) using Equations (9) and (10), respectively: S m e a n ( f ) = ∑ i = 1 5 1 n ∑ i = 1 n M ( i ) ( j ) , S M a h a l ( f ) = ∑ i = 1 n M a h a l D i s t ( i ) . Figure 5 shows a numerical comparison of the sum of mean feature magnitudes and the sum of Mahalanobis distances for a video from the UMN dataset [38]. It is noticeable that the normal and abnormal frames cannot be marked using mean feature magnitudes, whereas the Mahalanobis distances can somewhat find them.Thus, the Mahalanobis distance is more accurate for the ground truth than the mean feature magnitudes. We estimated the probabilities of S m e a n ( f ) and S M a h a l ( f ) using Equations (11) and (12), respectively, as P m e a n ( f ) = 4 e − S m e a n ( f ) 65 , P M a h a l ( f ) = 4 e − S M a h a l ( f ) 65 . In machine learning, the information gain is defined as the amount of information gained for a random variable or a signal from observing another random variable. For such a measure, Kullback—Leibler divergence D KL ( P M a h a l ( : ) ‖ P m e a n ( : ) ) [65] can be applied, where the distributions of P M a h a l ( : ) and P m e a n ( : ) include probability values of 900 frames. Equation (13) can be called the information gain achieved, if P M a h a l ( : ) is employed as an alternative to P m e a n ( : ) . If P M a h a l ( : ) and P m e a n ( : ) perfectly match, then D KL ( P M a h a l ( : ) ‖ P m e a n ( : ) ) = 0 , or else it can take values between 0 and ∞. D KL ( P M a h a l ( : ) ‖ P m e a n ( : ) ) = ∑ f = 1 900 P M a h a l ( f ) log P M a h a l ( f ) P m e a n ( f ) − P M a h a l ( f ) + P m e a n ( f ) . The calculated score of 118.41 in Equation (13) quantifies how much the probability distribution of P M a h a l ( : ) differs from the P m e a n ( : ) probability distribution on identical grounds. Explicitly, the information gain achieved by P M a h a l ( : ) with respect to P m e a n ( : ) was about 118. To keep pace with ground truth, the sum of the mean feature magnitudes for an abnormal event should be either greater or lesser than that of a normal event, but Figure 5a does not reflect this. On the other hand, to keep pace with the ground truth, the sum of Mahalanobis distances for an abnormal event should be either greater or lesser than that of a normal event, and Figure 5b reflects this. As Figure 5 shows that the measure of Mahalanobis distance is closer to the ground truth, the measure of Mahalanobis distance is more accurate than that of the mean feature magnitudes. The practical results for different datasets on identical grounds also reflected this proposition. 4.6. Analysis of the Best NetworkFrom the input videos, the spatial features of the independent frames conveyed information about the depicted scenes and objects, whereas the temporal features of the frame sequences deal with the information of motion and movement of the objects. A 2D-CNN can learn various spatial features (e.g., edges, corners, and textures) by combining the input frame with a number of filters. The 2D-CNN is highly effective in extracting spatial features from individual frames of a video, but it is not well-suited for capturing temporal information. To accurately capture the temporal dynamics of objects in a video, a different type of neural network must be utilized. A long short-term memory (LSTM) network is a better choice for capturing temporal information. A LSTM network is a deep learning architecture based on an artificial recurrent neural network (RNN). It was specifically designed to handle sequential data, including videos, when modeling the short-range and long-range relationships of sequence features [66]. It also resolves the gradient vanishing problem of the RNN. It is usually used for time series predictions [67]. However, to apply an LSTM network for temporal feature extraction, the output of the 2D-CNN spatial feature extractor can be fed to the LSTM network as input [66]. This can be performed by utilizing the output of the last fully connected layer of the 2D-CNN as the input for the LSTM. In this fashion, the LSTM network can utilize the spatial information extracted by the CNN, together with its capacity to recall past inputs to make predictions regarding the temporal relationships in the video.Both RNNs and LSTMs are laborious to train because they need memory-bandwidth-bound computation, which is laborious for hardware designers and eventually limits the applicability of neural networks solutions. By combining 2D-CNN and LSTM, it is possible to extract both spatial and temporal features from a videos. One of the reasons why researchers are more partial to using 2D-CNN over LSTM is the amount of training time required. The contemporary generation of well-known deep learning hardware applications mostly use Nvidia graphics cards, and they are optimized for processing 2D data with the greatest possible parallelism and speed, which 2D-CNN brings into service. Nevertheless, one of the main disadvantages of LSTM is its inability to handle temporal dependencies that are longer than a few steps. For example, when an LSTM was trained on a dataset with long-term dependencies (e.g., 100 steps), the network struggled to learn the task and generalize to new examples [68]. Furthermore, on the whole, when data are scarce or noisy, an LSTM tends to overfit the training data and suffers the loss of generalization ability [69]. As a result, it is discouraged to use an LSTM for extracting temporal features. A better solution for extracting temporal features is to employ a C3D network. For example, to take advantage of a 2D-CNN architecture, all filters and pooling kernels of 2D-CNN models can be inflated to a 3D-CNN, by equipping them with an additional temporal dimension, i.e., η × η filters become η × η × η filters. Afterwards, the weights of 2D filters can be repeated η times along the temporal dimension, to bootstrap parameters from pretrained 2D-CNN models to the 3D-CNN models [70].We propose TSAN, which generates reweighed attention features by measuring the degree of abnormality of snippets. Explicitly, the mechanism of TSAN maximizes attention on a subset of features, while minimizing the attention on noise. To a large extent, our exceptional performance comes from the utilization of the TSAN along with the fusion of the features of I3D and the rich contextual vision-language features of CLIP.Most of the existing approaches in Table 1 encode visual content by applying a CNN-based backbone of either C3D or I3D. Like the existing C3D or I3D based models in Table 1, our proposed C3D-TSAN and I3D-TSAN models demonstrated a performance of a comparable nature. Nevertheless, the I3D-TSAN model showed superior performance to the C3D-TSAN model on identical setups. The C3D was more suitable for spatiotemporal feature learning compared to the 2D CNN [27]. Fundamentally, the operation of 2D convolution tries to convolve an image and the 2D convolution kernel to extract the spatial features from an image, whereas the function of 3D convolution is to convolve the cube constructed by stacking several successive video frames and the 3D convolution kernel for extracting video features in the spatiotemporal dimension. More specifically, the C3D is an excellent model for applying 3D convolution kernels, which is natural for processing signals with spatiotemporal features, such as videos. Even so, its complicated structure stops it becoming deeper [71]. The I3D is an improved model based on C3D. Basically, I3D puts into practice an inflated version of the inception module architecture [30]. The fundamental features of the inception module are the employment of the incorporated effects of filters with various sizes and pooling kernels, all in one layer; as well as the manipulation of 1 × 1 convolutional filters, which not only assist in lessening the number of parameters but also put in place updated combinations of features to the next layers. This reveals the fact that the performance of the I3D-TSAN network is better than that of the C3D-TSAN, due to the improved architecture and more generalized features of the I3D.On the other hand, the ViT based CLIP-TSAN model showed the best performance among the three proposed models of C3D-TSAN, I3D-TSAN, and CLIP-TSAN. Both C3D and I3D have a traditional method of convolution, where some channels may be less useful information and consume computational power [72]. Basically, both C3D and I3D were pretrained on action recognition tasks. Differently from the action recognition problem, video anomaly detection depends on discriminative representations that clearly present the events in a scene. Thus, the existing C3D and I3D backbones are not suitable due to the issue of domain gap [1]. To explain this impediment, recently, ViT-based pretrained models (e.g., CLIP, X-CLIP, VideoSwin) were leveraged [20,34,35], which proved the effectiveness of feature representation learning. For example, the ViT-based method of Joo et al. [20] outperformed all existing CNN-based methods in Table 1. Similarly, our proposed CLIP-TSAN model showed almost the same performance as the model of Joo et al. [20]. Our proposed model C3D-CLIP-TSAN demonstrated a better performance than CLIP-TSAN, due to the information fusion [4] from CNN and ViT. Nevertheless, the C3D-CLIP-TSAN model showed slightly inferior performance to I3D-CLIP-TSAN on identical grounds. This was largely due to the I3D simply having a better architecture than the C3D [22]. For instance, the I3D operates on two 3D stream inputs, whereas the C3D operates on single 3D stream input [73]. 4.7. Ablation StudyWe conducted an ablation study to investigate the effectiveness of the Mahalanobis metric for our generalized framework of CNN-ViT-TSAN. We conducted the experiments in two cases: (i) with the Mahalanobis metric and (ii) without the Mahalanobis metric but with a mean feature magnitude of snippets for identical configuration settings. Table 2 reports their performance. From Table 2, it can be observed that the maximum 5.01%, 5.18%, 4.99%, 5.25%, and 5.56% performance gains were obtained for the UMN, UCSD-Ped1, UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets by applying the Mahalanobis metric. In summery, for these datasets, on average, a maximum 5% better performance could obtained empirically by employing the Mahalanobis metric (ı.e., without using the mean snippet feature magnitude). 4.8. Limitation of Our ModelOur WVAED models utilize extracted feature representations using CNN- and/or ViT-based pretrained feature extractors as input. As a result, the performance of our models partially depends on the pretrained feature extractors, making the calculation costly. In the testing phase, if the length of a snippet is Δ frames, then less than Δ frames video clips can be discarded or padded with the final label of the video. In this paper, we chose the former case with Δ = 16 frames. Thus, less than 16 frames of video clips were ignored, which might contain useful information for performance evaluation. 5. ConclusionsWe proposed an MIL-based generalized architecture named CNN-ViT-TSAN by applying CNN- and/or ViT-extracted features and the use of TSAN, to design a series of deep models for the WVAED problem. Our proposed TSAN mechanism minimized the attention on noise but maximized attention on a subset of features. Instead of using the mean feature magnitude, we uniquely introduced the usage of the Mahalanobis distance for the WVAED problem. At least a 5% performance gain was empirically recorded by employing the Mahalanobis distance with an identical setup as for the mean snippet feature magnitude. The information fusion between CNN and ViT was a unique contribution of this paper. Our deep models possessed a distinct degree of feature extraction ability and usability. One of our models (I3D-CLIP-TSAN) was capable of utilizing a better quality of features and confirmed a high separability between normal and abnormal snippets for VAED. The empirical results from several publicly available crowd datasets demonstrated the generalization ability and applicability of our models against the state-of-the-art approaches to the WVAED problem.Fundamentally, our model is a natural extension of video classification based on pretrained feature extractors from CNN and ViT. ViT technology has been gaining great interest and its utilization has spread broadly in computer vision. It is assumed that ViT can better capture long-range contextual relationships in videos. We employed CLIP [26] as a ViT feature extractor, and other options including VisualBERT [31], ViLBERT [32], and data efficient CLIP [33] could be employed. Recently, the XD-Violence [10] dataset has become a common benchmark for WVAED [10,19,20]. However, we could not use the XD-Violence [10] dataset due to some nontechnical reason regarding its accessibility (e.g., not being approved by the Norwegian Data Protection Authority); however, in future, we wish to test our models with it. Author ContributionsConceptualization, M.H.S.; methodology, M.H.S.; software, M.H.S.; validation, M.H.S., L.J. and C.W.O.; formal analysis, M.H.S., L.J. and C.W.O.; investigation, M.H.S., L.J. and C.W.O.; resources, M.H.S., L.J. and C.W.O.; data curation, M.H.S., L.J. and C.W.O.; writing—original draft preparation, M.H.S., L.J. and C.W.O.; writing—review and editing, M.H.S., L.J. and C.W.O.; visualization, M.H.S.; supervision, L.J. and C.W.O. All authors have read and agreed to the published version of the manuscript.FundingThis work is a part of the AI4CITIZENS research project (number 320783) supported by the Research Council of Norway.Institutional Review Board StatementNot applicable.Informed Consent StatementNot applicable.Data Availability StatementThe datasets used in this study are openly available and downloadable from http://mha.cs.umn.edu/proj_events.shtml#crowd, http://www.svcl.ucsd.edu/projects/anomaly/dataset.htm, www.cse.cuhk.edu.hk/leojia/projects/detectabnormal/dataset.html, https://svip-lab.github.io/dataset/campus_dataset.html, and https://webpages.charlotte.edu/cchen62/dataset.html, accessed on 28 March 2023. Those datasets, expect http://mha.cs.umn.edu/proj_events.shtml#crowd, were approved by the Sikt (Norwegian Agency for Shared Services in Education and Research) with the reference number of 720663.Conflicts of InterestThe authors declare no conflict of interest.ReferencesLiu, K.; Ma, H. Exploring Background-bias for Anomaly Detection in Surveillance Videos. In Proceedings of the International Conference on Multimedia (MM), Nice, France, 21–25 October 2019; pp. 1490–1499. [Google Scholar]Gong, D.; Liu, L.; Le, V.; Saha, B.; Mansour, M.R.; Venkatesh, S.; van den Hengel, A. Memorizing Normality to Detect Anomaly: Memory-Augmented Deep Autoencoder for Unsupervised Anomaly Detection. In Proceedings of the International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1705–1714. [Google Scholar]Zaheer, M.Z.; Mahmood, A.; Khan, M.H.; Segu, M.; Yu, F.; Lee, S.I. Generative Cooperative Learning for Unsupervised Video Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 14724–14734. [Google Scholar]Sharif, M.; Jiao, L.; Omlin, C. Deep Crowd Anomaly Detection by Fusing Reconstruction and Prediction Networks. Electronics 2023, 12, 1517. [Google Scholar]Chandola, V.; Banerjee, A.; Kumar, V. Anomaly detection: A survey. ACM Comput. Surv. 2009, 41, 15. [Google Scholar]Zhong, J.X.; Li, N.; Kong, W.; Liu, S.; Li, T.H.; Li, G. Graph Convolutional Label Noise Cleaner: Train a Plug-And-Play Action Classifier for Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 16–20 June 2019; pp. 1237–1246. [Google Scholar]Zaheer, M.Z.; Mahmood, A.; Astrid, M.; Lee, S. CLAWS: Clustering Assisted Weakly Supervised Learning with Normalcy Suppression for Anomalous Event Detection. In Proceedings of the European Conference Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Volume 12367, pp. 358–376. [Google Scholar]Sultani, W.; Chen, C.; Shah, M. Real-World Anomaly Detection in Surveillance Videos. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6479–6488. [Google Scholar]Zhang, J.; Qing, L.; Miao, J. Temporal Convolutional Network with Complementary Inner Bag Loss for Weakly Supervised Anomaly Detection. In Proceedings of the International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 4030–4034. [Google Scholar]Wu, P.; Liu, J.; Shi, Y.; Sun, Y.; Shao, F.; Wu, Z.; Yang, Z. Not only Look, But Also Listen: Learning Multimodal Violence Detection Under Weak Supervision. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Volume 12375, pp. 322–339. [Google Scholar]Zhu, Y.; Newsam, S.D. Motion-Aware Feature for Improved Video Anomaly Detection. In Proceedings of the British Machine Vision Conference (BMVC), Cardiff, UK, 9–12 September 2019; p. 270. [Google Scholar]Lv, H.; Zhou, C.; Cui, Z.; Xu, C.; Li, Y.; Yang, J. Localizing Anomalies From Weakly-Labeled Videos. IEEE Trans. Image Process. 2021, 30, 4505–4515. [Google Scholar] [CrossRef]Purwanto, D.; Chen, Y.T.; Fang, W.H. Dance with Self-Attention: A New Look of Conditional Random Fields on Anomaly Detection in Videos. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 173–183. [Google Scholar]Thakare, K.V.; Sharma, N.; Dogra, D.P.; Choi, H.; Kim, I.J. A multi-stream deep neural network with late fuzzy fusion for real-world anomaly detection. Expert Syst. Appl. 2022, 201, 117030. [Google Scholar]Sapkota, H.; Yu, Q. Bayesian Nonparametric Submodular Video Partition for Robust Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 3202–3211. [Google Scholar]Liu, Y.; Liu, J.; Ni, W.; Song, L. Abnormal Event Detection with Self-guiding Multi-instance Ranking Framework. In Proceedings of the International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, 18–23 July 2022; pp. 1–7. [Google Scholar]Carbonneau, M.A.; Cheplygina, V.; Granger, E.; Gagnon, G. Multiple instance learning: A survey of problem characteristics and applications. Pattern Recognit. 2018, 77, 329–353. [Google Scholar]Liu, Y.; Yang, D.; Wang, Y.; Liu, J.; Song, L. Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models. arXiv 2023, arXiv:2302.05087. [Google Scholar]Tian, Y.; Pang, G.; Chen, Y.; Singh, R.; Verjans, J.W.; Carneiro, G. Weakly-supervised Video Anomaly Detection with Robust Temporal Feature Magnitude Learning. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 4955–4966. [Google Scholar]Joo, H.K.; Vo, K.; Yamazaki, K.; Le, N. CLIP-TSA: CLIP-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection. arXiv 2022, arXiv:2212.05136. [Google Scholar]Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 221–231. [Google Scholar]Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 2065–2074. [Google Scholar]Ho, V.K.V.; Truong, S.; Yamazaki, K.; Raj, B.; Tran, M.T.; Le, N. AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation. Int. J. Comput. Vis. 2023, 131, 302–323. [Google Scholar]Yamazaki, K.; Vo, K.; Truong, S.; Raj, B.; Le, N. VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning. arXiv 2022, arXiv:2211.15103. [Google Scholar] [CrossRef]Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), Virtual, 18–24 July 2021; Volume 139, pp. 8748–8763. [Google Scholar]Tran, D.; Bourdev, L.D.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 4489–4497. [Google Scholar]Wang, L.; Xiong, Y.; Wang, Z.; Qiao, Y.; Lin, D.; Tang, X.; Gool, L.V. Temporal Segment Networks for Action Recognition in Videos. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2740–2755. [Google Scholar] [PubMed]Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.E.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.; Chang, K. VisualBERT: A Simple and Performant Baseline for Vision and Language. arXiv 2019, arXiv:1908.03557. [Google Scholar]Lu, J.; Batra, D.; Parikh, D.; Lee, S. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 8–14 December 2019; pp. 13–23. [Google Scholar]Li, Y.; Liang, F.; Zhao, L.; Cui, Y.; Ouyang, W.; Shao, J.; Yu, F.; Yan, J. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 25–29 April 2022. [Google Scholar]Li, S.; Liu, F.; Jiao, L. Self-Training Multi-Sequence Learning with Transformer for Weakly Supervised Video Anomaly Detection. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Conference on Innovative Applications of Artificial Intelligence (IAAI), Symposium on Educational Advances in Artificial Intelligence (EAAI), Virtual, 22 February–1 March 2022; pp. 1395–1403. [Google Scholar]Lv, H.; Yue, Z.; Sun, Q.; Luo, B.; Cui, Z.; Zhang, H. Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. arXiv 2023, arXiv:2303.12369. [Google Scholar]Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. In Proceedings of the International Conference on Learning Representations (ICLR), Puerto Rico, PR, USA, 2–4 May 2016. [Google Scholar]Wang, X.; Girshick, R.B.; Gupta, A.; He, K. Non-Local Neural Networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7794–7803. [Google Scholar]University, M. Detection of Unusual Crowd Activities in Both Indoor and Outdoor Scenes. 2021. Available online: http://mha.cs.umn.edu/proj_events.shtml#crowd (accessed on 28 March 2023).He, C.; Shao, J.; Sun, J. An anomaly-introduced learning method for abnormal event detection. Multim. Tools Appl. 2018, 77, 29573–29588. [Google Scholar]Liu, W.; Luo, W.; Lian, D.; Gao, S. Future Frame Prediction for Anomaly Detection - A New Baseline. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 6536–6545. [Google Scholar]Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations, (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]Sharif, M.H. An Eigenvalue Approach to Detect Flows and Events in Crowd Videos. J. Circuits Syst. Comput. 2017, 26, 1750110. [Google Scholar] [CrossRef]Sharif, M.H.; Jiao, L.; Omlin, C.W. Deep Crowd Anomaly Detection: State-of-the-Art, Challenges, and Future Research Directions. arXiv 2022, arXiv:2210.13927. [Google Scholar]Rahman, Q.I.; Schmeisser, G. Characterization of the speed of convergence of the trapezoidal rule. Numer. Math. 1990, 57, 123–138. [Google Scholar] [CrossRef]Zaheer, M.Z.; Mahmood, A.; Shin, H.; Lee, S.I. A Self-Reasoning Framework for Anomaly Detection Using Video-Level Labels. IEEE Signal Process. Lett. 2020, 27, 1705–1709. [Google Scholar] [CrossRef]Wan, B.; Fang, Y.; Xia, X.; Mei, J. Weakly Supervised Video Anomaly Detection via Center-Guided Discriminative Learning. In Proceedings of the International Conference on Multimedia and Expo (ICME), London, UK, 6–10 July 2020; pp. 1–6. [Google Scholar]Majhi, S.; Das, S.; Brémond, F. DAM: Dissimilarity Attention Module for Weakly-supervised Video Anomaly Detection. In Proceedings of the International Conference on Advanced Video and Signal Based Surveillance (AVSS), Washington, DC, USA, 16–19 November 2021; pp. 1–8. [Google Scholar]Wu, P.; Liu, J. Learning Causal Temporal Relation and Feature Discrimination for Anomaly Detection. IEEE Trans. Image Process. 2021, 30, 3513–3527. [Google Scholar] [CrossRef]Yu, S.; Wang, C.; Ma, Q.; Li, Y.; Wu, J. Cross-Epoch Learning for Weakly Supervised Anomaly Detection in Surveillance Videos. IEEE Signal Process. Lett. 2021, 28, 2137–2141. [Google Scholar] [CrossRef]Feng, J.C.; Hong, F.T.; Zheng, W.S. MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection. In Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 14009–14018. [Google Scholar]Zaheer, M.Z.; Mahmood, A.; Astrid, M.; Lee, S. Clustering Aided Weakly Supervised Training to Detect Anomalous Events in Surveillance Videos. arXiv 2022, arXiv:2203.13704. [Google Scholar] [CrossRef]Cao, C.; Zhang, X.; Zhang, S.; Wang, P.; Zhang, Y. Weakly Supervised Video Anomaly Detection Based on Cross-Batch Clustering Guidance. arXiv 2022, arXiv:2212.08506. [Google Scholar]Cao, C.; Zhang, X.; Zhang, S.; Wang, P.; Zhang, Y. Adaptive graph convolutional networks for weakly supervised anomaly detection in videos. arXiv 2022, arXiv:2202.06503. [Google Scholar] [CrossRef]Tan, W.; Yao, Q.; Liu, J. Overlooked Video Classification in Weakly Supervised Video Anomaly Detection. arXiv 2022, arXiv:2210.06688. [Google Scholar]Yi, S.; Fan, Z.; Wu, D. Batch feature standardization network with triplet loss for weakly-supervised video anomaly detection. Image Vis. Comput. 2022, 120, 104397. [Google Scholar] [CrossRef]Yu, S.; Wang, C.; Xiang, L.; Wu, J. TCA-VAD: Temporal Context Alignment Network for Weakly Supervised Video Anomly Detection. In Proceedings of the International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]Gong, Y.; Wang, C.; Dai, X.; Yu, S.; Xiang, L.; Wu, J. Multi-Scale Continuity-Aware Refinement Network for Weakly Supervised Video Anomaly Detection. In Proceedings of the International Conference on Multimedia and Expo (ICME), Taipei, Taiwan, 18–22 July 2022; pp. 1–6. [Google Scholar]Majhi, S.; Dai, R.; Kong, Q.; Garattoni, L.; Francesca, G.; Bremond, F. Human-Scene Network: A Novel Baseline with Self-rectifying Loss for Weakly supervised Video Anomaly Detection. arXiv 2023, arXiv:2301.07923. [Google Scholar]Park, S.; Kim, H.; Kim, M.; Kim, D.; Sohn, K. Normality Guided Multiple Instance Learning for Weakly Supervised Video Anomaly Detection. In Proceedings of the Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 2664–2673. [Google Scholar]Pu, Y.; Wu, X.; Wang, S. Learning Prompt-Enhanced Context Features for Weakly-Supervised Video Anomaly Detection. arXiv 2023, arXiv:2306.14451. [Google Scholar]Sun, S.; Gong, X. Long-Short Temporal Co-Teaching for Weakly Supervised Video Anomaly Detection. arXiv 2023, arXiv:2303.18044. [Google Scholar]Wang, L.; Wang, X.; Liu, F.; Li, M.; Hao, X.; Zhao, N. Attention-guided MIL weakly supervised visual anomaly detection. Measurement 2023, 209, 112500. [Google Scholar] [CrossRef]Nemenyi, P. Distribution-Free Multiple Comparisons. Ph.D. Thesis, Princeton University, Princeton, NJ, USA, 1963. [Google Scholar]Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]Bousmina, A.; Selmi, M.; Ben Rhaiem, M.A.; Farah, I.R. A Hybrid Approach Based on GAN and CNN-LSTM for Aerial Activity Recognition. Remote Sens. 2023, 15, 3626. [Google Scholar] [CrossRef]Aksan, F.; Li, Y.; Suresh, V.; Janik, P. CNN-LSTM vs. LSTM-CNN to Predict Power Flow Direction: A Case Study of the High-Voltage Subnet of Northeast Germany. Sensors 2023, 23, 901. [Google Scholar] [CrossRef] [PubMed]Trinh, T.H.; Dai, A.M.; Luong, T.; Le, Q.V. Learning Longer-term Dependencies in RNNs with Auxiliary Losses. In Proceedings of the International Conference on Machine Learning (ICML), Stockholm, Sweden, 10–15 July 2018; Volume 80, pp. 4972–4981. [Google Scholar]Suzgun, M.; Belinkov, Y.; Shieber, S.M. On Evaluating the Generalization of LSTM Models in Formal Languages. In Proceedings of the Society for Computation in Linguistics (SCiL), New York, NY, USA, 3–6 January 2019; pp. 277–286. [Google Scholar]Nguyen, N.G.; Phan, D.; Lumbanraja, F.R.; Faisal, M.R.; Abapihi, B.; Purnama, B.; Delimayanti, M.K.; Mahmudah, K.R.; Kubo, M.; Satou, K. Applying Deep Learning Models to Mouse Behavior Recognition. J. Biomed. Sci. Eng. 2019, 12, 183–196. [Google Scholar] [CrossRef]Wang, X.; Miao, Z.; Zhang, R.; Hao, S. I3D-LSTM: A New Model for Human Action Recognition. In Proceedings of the International Conference on Advanced Materials, Intelligent Manufacturing and Automation (AMIMA), Zhuhai, China, 17–19 May 2019; pp. 1–6. [Google Scholar]Liu, G.; Zhang, C.; Xu, Q.; Cheng, R.; Song, Y.; Yuan, X.; Sun, J. I3D-Shufflenet Based Human Action Recognition. Algorithms 2020, 13, 301. [Google Scholar] [CrossRef]Obregon, D.F.; Navarro, J.L.; Santana, O.J.; Sosa, D.H.; Santana, M.C. Towards cumulative race time regression in sports: I3D ConvNet transfer learning in ultra-distance running events. In Proceedings of the International Conference on Pattern Recognition (ICPR), Montreal, QC, Canada, 21–25 August 2022; pp. 805–811. [Google Scholar] Sensors 23 07734 g001

Figure 1. Generalized architecture of our proposed CNN-ViT-TSAN framework. Figure 1. Generalized architecture of our proposed CNN-ViT-TSAN framework. Sensors 23 07734 g001

Figure 2. Visualization of sample testing results with various datasets. Pink regions show the manually labeled abnormal events, while the yellow regions indicate the areas below the ROC curves. Figure 2. Visualization of sample testing results with various datasets. Pink regions show the manually labeled abnormal events, while the yellow regions indicate the areas below the ROC curves. Sensors 23 07734 g002a

Figure 3. Nemenyi [64] post hoc critical distance diagram with α = 0.05 using 1 − A U C o scores in Table 1 for the ShanghaiTech and UCF-Crime datasets. Figure 3. Nemenyi [64] post hoc critical distance diagram with α = 0.05 using 1 − A U C o scores in Table 1 for the ShanghaiTech and UCF-Crime datasets. Sensors 23 07734 g003

Figure 4. Nemenyi [64] post hoc critical distance diagram with α = 0.10 using the 1 − A U C o scores in Table 1 for the UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets. Figure 4. Nemenyi [64] post hoc critical distance diagram with α = 0.10 using the 1 − A U C o scores in Table 1 for the UCSD-Ped2, ShanghaiTech, and UCF-Crime datasets. Sensors 23 07734 g004

Figure 5. Numerical comparison of mean feature magnitudes and Mahalanobis distances. (a) Normal and abnormal frames cannot be distinguished using mean feature magnitudes. (b) Mahalanobis Distances can somewhat make difference between normal and abnormal frames. Figure 5. Numerical comparison of mean feature magnitudes and Mahalanobis distances. (a) Normal and abnormal frames cannot be distinguished using mean feature magnitudes. (b) Mahalanobis Distances can somewhat make difference between normal and abnormal frames. Sensors 23 07734 g005

Table 1. Frame-level A U C o score comparison of the various weakly-supervised methods and datasets. Column-wise the best score is bolded and the second best score is underlined. Table 1. Frame-level A U C o score comparison of the various weakly-supervised methods and datasets. Column-wise the best score is bolded and the second best score is underlined. YearWeaklySupervisedModelFeatureFrame-Level Performance Scores from Different DatasetsUCSD-Ped2ShanghaiTechUCF-Crime AUC o 1 − AUC o AUC o 1 − AUC o AUC o 1 − AUC o 2018Sultani et al. [8]C3D——0.83170.16830.75410.2459Sultani et al. [8]I3D——0.85330.14670.77920.22082019Zhong et al. [6]C3D——0.76440.23560.81080.1892Zhong et al. [6]TSN0.93200.06800.84440.15560.82120.1788Zhang et al. [9]C3D——0.82500.17500.78700.21302020Zaheer et al. [46]C3D-self0.94470.05530.84160.15840.79540.2046Zaheer et al. [7]C3D——0.89670.10330.83030.1697Wan et al. [47]I3D——0.85380.14620.78960.21042021Purwanto et al. [13]TRN——0.96850.03150.85000.1500Tian et al. [19]C3D——0.91510.08490.83280.1672Majhi et al. [48]I3D——0.88220.11780.82670.1733Tian et al. [19]I3D0.98600.01400.97210.02790.84300.1570Wu et al. [49]I3D——0.97480.02520.84890.1511Yu et al. [50]I3D——0.87830.12170.82150.1785Lv et al. [12]I3D——0.85300.14700.85380.1462Feng et al. [51]C3D——0.93130.06870.81400.1860Feng et al. [51]I3D——0.94830.05170.82300.17702022Zaheer et al. [3]ResNext——0.86210.13790.79840.7984Zaheer et al. [52]C3D0.94910.05090.90120.09880.83370.1663Zaheer et al. [52]3DResNext0.95790.04210.91460.08540.84160.1584Joo et al. [20]C3D——0.97190.02810.83940.1606Joo et al. [20]I3D——0.97980.02020.84660.1534Joo et al. [20]CLIP——0.98320.01680.87580.1242Cao et al. [53]I3D——0.96450.03550.85870.1413Li et al. [34]I3D——0.96080.03920.85300.1470Cao et al. [54]I3D-graph——0.96050.03950.84670.1533Tan et al. [55]I3D——0.97540.02460.86710.1329Li et al. [34]VideoSwin——0.97320.02680.85620.1438Yi et al. [56]I3D——0.97650.02350.84290.1571Yu et al. [57]C3D——0.88350.11650.82080.1792Yu et al. [57]I3D——0.89910.10090.83750.1625Gong et al. [58]I3D——0.90100.09900.81000.19002023Majhi et al. [59]13D-Res——0.96220.03780.85300.1470Park et al. [60]C3D——0.96020.03980.83430.1657Park et al. [60]I3D——0.97430.02570.85630.1437Pu et al. [61]I3D——0.98140.01860.86760.1324Lv et al. [35]X-CLIP——0.96780.03220.86750.1325Sun et al. [62]C3D——0.96560.03440.83470.1653Sun et al. [62]I3D——0.97920.02080.85880.1412Wang et al. [63]C3D——0.94010.05990.81480.1852C3D-TSAN (Ours)C3D0.96750.03250.96080.03920.85780.1422I3D-TSAN (Ours)I3D0.97580.02420.97430.02570.86500.1350CLIP-TSAN (Ours)CLIP0.98110.01890.98060.01940.87630.1237C3D-CLIP-TSAN (Ours)C3D+CLIP0.98240.01760.98130.01870.88020.1198I3D-CLIP-TSAN (Ours)I3D+CLIP0.98390.01610.98660.01340.88970.1103 Table 2. Ablation study of Mahalanobis metric on various datasets. Column-wise the best score is bolded and the second best score is underlined. Table 2. Ablation study of Mahalanobis metric on various datasets. Column-wise the best score is bolded and the second best score is underlined. FeatureMahalanobisMetricIncluded?Frame-Level Performance Scores from Different DatasetsUMNUCSD-Ped1UCSD-Ped2ShanghaiTechUCF-Crime AUC o Gain AUC o Gain AUC o Gain AUC o Gain AUC o GainC3DNo0.91361.000.85531.000.92141.000.91291.000.82621.00Yes0.95174.17%0.89965.18%0.96754.99%0.96085.25%0.85783.82%I3DNo0.93621.000.89031.000.94891.000.93591.000.84011.00Yes0.96443.01%0.90852.04%0.97582.83%0.97434.09%0.86502.96%CLIPNo0.94171.000.90631.000.95971.000.93911.000.83461.00Yes0.97313.33%0.92742.33%0.98112.23%0.98064.42%0.87634.99%C3D+CLIPNo0.94051.000.88711.000.93961.000.94221.000.83481.00Yes0.98765.01%0.93155.01%0.98244.56%0.98134.15%0.88125.56%I3D+CLIPNo0.94611.000.89431.000.94481.000.94001.000.84621.00Yes0.99034.67%0.94025.13%0.98394.14%0.98664.96%0.88975.14% Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Share and Cite MDPI and ACS Style

Sharif, M.H.; Jiao, L.; Omlin, C.W. CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection. Sensors 2023, 23, 7734. https://doi.org/10.3390/s23187734

AMA Style

Sharif MH, Jiao L, Omlin CW. CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection. Sensors. 2023; 23(18):7734. https://doi.org/10.3390/s23187734

Chicago/Turabian Style

Sharif, Md. Haidar, Lei Jiao, and Christian W. Omlin. 2023. "CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection" Sensors 23, no. 18: 7734. https://doi.org/10.3390/s23187734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here. Article Metrics No No Article Access Statistics For more information on the journal statistics, click here. Multiple requests from the same IP address are counted as one view. Zoom | Orient | As Lines | As Sticks | As Cartoon | As Surface | Previous Scene | Next Scene Cite Export citation file: BibTeX | EndNote | RIS MDPI and ACS Style

Sharif, M.H.; Jiao, L.; Omlin, C.W. CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection. Sensors 2023, 23, 7734. https://doi.org/10.3390/s23187734

AMA Style

Sharif MH, Jiao L, Omlin CW. CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection. Sensors. 2023; 23(18):7734. https://doi.org/10.3390/s23187734

Chicago/Turabian Style

Sharif, Md. Haidar, Lei Jiao, and Christian W. Omlin. 2023. "CNN-ViT Supported Weakly-Supervised Video Segment Level Anomaly Detection" Sensors 23, no. 18: 7734. https://doi.org/10.3390/s23187734

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here. clear Sensors, EISSN 1424-8220, Published by MDPI RSS Content Alert Further Information Article Processing Charges Pay an Invoice Open Access Policy Contact MDPI Jobs at MDPI Guidelines For Authors For Reviewers For Editors For Librarians For Publishers For Societies For Conference Organizers MDPI Initiatives Sciforum MDPI Books Preprints.org Scilit SciProfiles Encyclopedia JAMS Proceedings Series Follow MDPI LinkedIn Facebook Twitter

© 1996-2024 MDPI (Basel, Switzerland) unless otherwise stated Disclaimer Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. Terms and Conditions Privacy Policy We use cookies on our website to ensure you get the best experience. Read more about our cookies here. Accept Share Link Copy clear Share https://www.mdpi.com/2470150 clear Back to TopTop

【本文地址】

公司简介

联系我们

今日新闻

c3d i3d

点击排行

实验室常用的仪器、试剂和: 说到实验室常用到的东西，主要就分为仪器、试剂和耗

不用再找了，全球10大实验: 01、赛默飞世尔科技（热电）Thermo Fisher Scientif

三代水柜的量产巅峰T-72坦: 作者：寞寒最近，西边闹腾挺大，本来小寞以为忙完这

通风柜跟实验室通风系统有: 说到通风柜跟实验室通风，不少人都纠结二者到底是不

集消毒杀菌、烘干收纳为一: 厨房是家里细菌较多的地方，潮湿的环境、没有完全密

实验室设备之全钢实验台如: 全钢实验台是实验室家具中较为重要的家具之一，很多

图片新闻

实验室药品柜的特性有哪些: 实验室药品柜是实验室家具的重要组成部分之一，主要

小学科学实验中有哪些教学: 计算机计算器一般打孔器打气筒仪器车显微镜

实验室各种仪器原理动图讲: 1.紫外分光光谱UV分析原理：吸收紫外光能量，引起分

高中化学常见仪器及实验装: 1、可加热仪器：2、计量仪器：（1）仪器A的名称：量

微生物操作主要设备和器具: 今天盘点一下微生物操作主要设备和器具，别嫌我啰嗦

浅谈通风柜使用基本常识: 　众所周知，通风柜功能中最主要的就是排气功能。在

Sensors

Sensors

今日新闻

点击排行

推荐新闻

图片新闻

专题文章